林嶔 (Lin, Chin)
Lesson 5
在今天之前的課程我們較重視理論的推導,而之後我們會更重視實務開發,並且透過大量的實驗輔助說明一些較為難懂的數學原理。
讓我們使用套件來開發人工神經網路吧,這能讓我們快速的實驗各式不同的網路結構對分類結果帶來的影響。
– 透過前幾節課的推導,我們應該了解到這種開發框架其實也不是很難實現,這是因為目前的人工神經網路無論多深多複雜,也一定可以被定義成一個複雜函數的組合,透過分別解微分以及連鎖律我們將能求得任何參數的梯度,從而能實現梯度下降法來訓練網路。
– 但要注意一點,僅有64位元的作業系統能安裝MxNet。
– 他的安裝方法比較特別,並且有安裝GPU版本的方法,下面是在WINDOW系統安裝CPU版本的作法:
– 如果是你的R語言是3.5以下的版本,你可能要參考這個語法安裝
cran <- getOption("repos")
cran["dmlc"] <- "https://s3-us-west-2.amazonaws.com/apache-mxnet/R/CRAN/"
options(repos = cran)
install.packages("mxnet")
– 如果是你的R語言是3.6以上的版本,你可能要參考這個語法安裝:
install.packages("https://s3.ca-central-1.amazonaws.com/jeremiedb/share/mxnet/CPU/3.6/mxnet.zip", repos = NULL)
– 如果是你的R語言是3.6以上的版本,你可能要參考這個語法安裝:
cran <- getOption("repos")
cran["dmlc"] <- "https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/R/CRAN/"
options(repos = cran)
install.packages("mxnet")
library(mxnet)
資料結構
預測函數(網路結構)
優化方法
Iterator
Model architecture
Optimizer
– 注意一下我們所定義的維度,在MxNet框架中最後一個維度都是Sample size。
set.seed(0)
X1 = rnorm(1000)
X2 = rnorm(1000)
Y = X1 * 0.7 + X2 * 1.3 - 3.1 + rnorm(1000)
X.array = array(rbind(X1, X2), dim = c(2, 1000))
Y.array = array(Y, dim = c(1, 1000))
my_iterator_core = function(batch_size) {
batch = 0
batch_per_epoch = length(Y.array)/batch_size
reset = function() {batch <<- 0}
iter.next = function() {
batch <<- batch+1
if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
}
value = function() {
idx = 1:batch_size + (batch - 1) * batch_size
idx[idx > ncol(X.array)] = sample(1:ncol(X.array), sum(idx > ncol(X.array)))
data = mx.nd.array(X.array[,idx, drop=FALSE])
label = mx.nd.array(Y.array[,idx, drop=FALSE])
return(list(data = data, label = label))
}
return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}
my_iterator_func <- setRefClass("Custom_Iter",
fields = c("iter", "batch_size"),
contains = "Rcpp_MXArrayDataIter",
methods = list(
initialize = function(iter, batch_size = 100){
.self$iter <- my_iterator_core(batch_size = batch_size)
.self
},
value = function(){
.self$iter$value()
},
iter.next = function(){
.self$iter$iter.next()
},
reset = function(){
.self$iter$reset()
},
finalize=function(){
}
)
)
my_iter = my_iterator_func(iter = NULL, batch_size = 20)
my_iter$reset()
my_iter$iter.next()
## [1] TRUE
my_iter$value()
## $data
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.2629542 -0.3262334 1.3297993 1.272429 0.4146414 -1.53995001
## [2,] -0.2868516 1.8411069 -0.1567643 -1.389803 -1.4731040 -0.06951893
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] -0.9285671 -0.2947204 -0.005767173 2.404653 0.7635934 -0.7990093
## [2,] 0.2392414 0.2504199 -0.264423937 -1.975400 -0.4498760 0.9264656
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] -1.147657 -0.2894616 -0.2992151 -0.4115108 0.2522235 -0.8919211
## [2,] -2.319971 0.6139336 -1.4731401 -0.2190492 -1.4034163 0.8206284
## [,19] [,20]
## [1,] 0.4356833 -1.2375385
## [2,] -0.5927263 0.4219688
##
## $label
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] -2.144493 -0.922995 -2.382214 -4.31842 -4.232431 -4.871059 -4.12126
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] -2.694021 -3.282202 -4.826817 -3.972437 -3.884252 -6.784581 -1.874525
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] -4.103997 -2.750563 -2.639356 -1.861134 -6.239434 -3.645907
– 函數「mx.symbol.Variable」用來產生一個新的模型起點,他通常是Iterator所輸出的名稱,有時候也可以用來指定權重的名稱
– 函數「mx.symbol.FullyConnected」代表著全連接層,這相當於之前MLP做線性運算的\(L\)函數,其中參數「num.hidden」意旨輸出維度
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
weight = mx.symbol.Variable(name = 'weight')
bias = mx.symbol.Variable(name = 'bias')
pred_y = mx.symbol.FullyConnected(data = data, weight = weight, bias = bias,
num.hidden = 1, no.bias = FALSE, name = 'pred_y')
mx.symbol.infer.shape(pred_y, data = c(2, 20))
## $arg.shapes
## $arg.shapes$data
## [1] 2 20
##
## $arg.shapes$weight
## [1] 2 1
##
## $arg.shapes$bias
## [1] 1
##
##
## $out.shapes
## $out.shapes$pred_y_output
## [1] 1 20
##
##
## $aux.shapes
## named list()
residual = mx.symbol.broadcast_minus(lhs = label, rhs = pred_y)
square_residual = mx.symbol.square(data = residual)
mean_square_residual = mx.symbol.mean(data = square_residual, axis = 0, keepdims = FALSE)
mse_loss = mx.symbol.MakeLoss(data = mean_square_residual, name = 'rmse')
mx.symbol.infer.shape(mse_loss, data = c(2, 20), label = c(1, 20))$out.shapes
## $rmse_output
## [1] 1
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
– 注意,這裡的num.round是指epoch的意思,而非batch。
mx.set.seed(0)
lr_model = mx.model.FeedForward.create(symbol = mse_loss, X = my_iter, optimizer = my_optimizer,
array.batch.size = 20, ctx = mx.cpu(), num.round = 10)
lr_model$arg.params
## $weight
## [,1]
## [1,] 0.6362522
## [2,] 1.3090633
##
## $bias
## [1] -3.322609
lm(Y ~ X1 + X2)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Coefficients:
## (Intercept) X1 X2
## -3.0312 0.7117 1.3192
my.eval.metric.loss <- mx.metric.custom(
name = "MSE-loss",
function(real, pred) {
return(as.array(pred))
}
)
my_logger = mx.metric.logger$new()
lr_model = mx.model.FeedForward.create(symbol = mse_loss, X = my_iter, optimizer = my_optimizer,
array.batch.size = 20, ctx = mx.cpu(), num.round = 10,
eval.metric = my.eval.metric.loss,
epoch.end.callback = mx.callback.log.train.metric(1, my_logger))
plot(my_logger$train, type = 'l', ylab = 'MSE loss')
– 這是一個模擬的邏輯斯回歸樣本,請你試著用MxNet編寫程式並求解其權重參數。
set.seed(0)
X1 = rnorm(1000)
X2 = rnorm(1000)
LR = X1 * 2 + X2 * 3 + rnorm(1000)
PROP = 1/(1+exp(-LR))
Y = as.integer(PROP > runif(1000))
X.array = array(rbind(X1, X2), dim = c(2, 1000))
Y.array = array(Y, dim = c(1, 1000))
– 這是函數「glm」的運算結果:
glm(Y ~ X1 + X2, family = 'binomial')
##
## Call: glm(formula = Y ~ X1 + X2, family = "binomial")
##
## Coefficients:
## (Intercept) X1 X2
## -0.01483 1.91820 2.70523
##
## Degrees of Freedom: 999 Total (i.e. Null); 997 Residual
## Null Deviance: 1385
## Residual Deviance: 654.9 AIC: 660.9
\[ \begin{align} log(\frac{{p}}{1-p}) & = b_{0} + b_{1}x \\ p & = \frac{{1}}{1+e^{-b_{0} - b_{1}x}} \\ loss & = CE(y, p) = -\frac{{1}}{n}\sum \limits_{i=1}^{n} \left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right) \end{align} \]
my_iterator_core = function(batch_size) {
batch = 0
batch_per_epoch = length(Y.array)/batch_size
reset = function() {batch <<- 0}
iter.next = function() {
batch <<- batch+1
if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
}
value = function() {
idx = 1:batch_size + (batch - 1) * batch_size
idx[idx > ncol(X.array)] = sample(1:ncol(X.array), sum(idx > ncol(X.array)))
data = mx.nd.array(X.array[,idx, drop=FALSE])
label = mx.nd.array(Y.array[,idx, drop=FALSE])
return(list(data = data, label = label))
}
return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}
my_iterator_func <- setRefClass("Custom_Iter",
fields = c("iter", "batch_size"),
contains = "Rcpp_MXArrayDataIter",
methods = list(
initialize = function(iter, batch_size = 100){
.self$iter <- my_iterator_core(batch_size = batch_size)
.self
},
value = function(){
.self$iter$value()
},
iter.next = function(){
.self$iter$iter.next()
},
reset = function(){
.self$iter$reset()
},
finalize=function(){
}
)
)
my_iter = my_iterator_func(iter = NULL, batch_size = 20)
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
linear_pred = mx.symbol.FullyConnected(data = data,
num.hidden = 1, no.bias = FALSE, name = 'linear_pred')
logistic_pred = mx.symbol.sigmoid(data = linear_pred, name = 'logistic_pred')
eps = 1e-8
ce_loss_pos = mx.symbol.broadcast_mul(mx.symbol.log(logistic_pred + eps), label)
ce_loss_neg = mx.symbol.broadcast_mul(mx.symbol.log(1 - logistic_pred + eps), 1 - label)
ce_loss_mean = 0 - mx.symbol.mean(ce_loss_pos + ce_loss_neg)
ce_loss = mx.symbol.MakeLoss(ce_loss_mean, name = 'ce_loss')
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
mx.set.seed(0)
logistic_model = mx.model.FeedForward.create(symbol = ce_loss, X = my_iter, optimizer = my_optimizer,
array.batch.size = 20, ctx = mx.cpu(), num.round = 100)
logistic_model$arg.params
## $linear_pred_weight
## [,1]
## [1,] 1.842254
## [2,] 2.765283
##
## $linear_pred_bias
## [1] 0.0406286
– 在前面的練習題中,我們已經寫好了邏輯斯迴歸的Iterator、Model architecture、Optimizer等三個部分,現在讓我們再編寫結合這三者的Executor。
my_executor = mx.simple.bind(symbol = ce_loss,
data = c(2, 20), label = c(1, 20),
ctx = mx.cpu(), grad.req = "write")
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = ce_loss,
input.shape = list(data = c(2, 20), label = c(1, 20)),
output.shape = NULL,
initializer = mxnet:::mx.init.uniform(0.01),
ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
– 「my_iter$iter.next()」會持續回答TRUE直到一個epoch結束,而每個epoch的
my_iter$reset()
while (my_iter$iter.next()) {
my_values <- my_iter$value()
mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
mx.exec.forward(my_executor, is.train = TRUE)
mx.exec.backward(my_executor)
update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
}
set.seed(0)
for (i in 1:100) {
my_iter$reset()
batch_loss = NULL
while (my_iter$iter.next()) {
my_values <- my_iter$value()
mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
mx.exec.forward(my_executor, is.train = TRUE)
mx.exec.backward(my_executor)
update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$ce_loss_output))
}
if (i %% 10 == 0 | i < 10) {
cat(paste0("epoch = ", i,
": Cross-Entropy (CE) = ", formatC(mean(batch_loss), format = "f", 4),
"; weight = ", paste(formatC(as.array(my_executor$ref.arg.arrays$linear_pred_weight), format = "f", 3), collapse = ", "),
"; bias = ", formatC(as.array(my_executor$ref.arg.arrays$linear_pred_bias), format = "f", 3), "\n"))
}
}
## epoch = 1: Cross-Entropy (CE) = 0.3328; weight = 1.681, 2.546; bias = 0.038
## epoch = 2: Cross-Entropy (CE) = 0.3305; weight = 1.763, 2.657; bias = 0.040
## epoch = 3: Cross-Entropy (CE) = 0.3301; weight = 1.802, 2.710; bias = 0.040
## epoch = 4: Cross-Entropy (CE) = 0.3300; weight = 1.821, 2.736; bias = 0.040
## epoch = 5: Cross-Entropy (CE) = 0.3300; weight = 1.831, 2.750; bias = 0.040
## epoch = 6: Cross-Entropy (CE) = 0.3300; weight = 1.836, 2.757; bias = 0.041
## epoch = 7: Cross-Entropy (CE) = 0.3300; weight = 1.839, 2.761; bias = 0.041
## epoch = 8: Cross-Entropy (CE) = 0.3300; weight = 1.841, 2.763; bias = 0.041
## epoch = 9: Cross-Entropy (CE) = 0.3300; weight = 1.841, 2.764; bias = 0.041
## epoch = 10: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 20: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 30: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 40: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 50: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 60: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 70: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 80: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 90: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 100: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
logistic_model <- mxnet:::mx.model.extract.model(symbol = logistic_pred,
train.execs = list(my_executor))
mx.model.save(logistic_model, "logistic_regression", iteration = 0)
logistic_model = mx.model.load("logistic_regression", iteration = 0)
predict_Y = predict(logistic_model, X.array, array.layout = "colmajor")
confusion_table = table(predict_Y > 0.5, Y)
print(confusion_table)
## Y
## 0 1
## FALSE 445 76
## TRUE 74 405
– MxNet的底層執行器來既然能進行Forward/Backward程序,當然也可以只進行Forward程序:
all_layers = logistic_model$symbol$get.internals()
linear_pred_output = all_layers$get.output(which(all_layers$outputs == 'linear_pred_output'))
symbol_group = mx.symbol.Group(c(linear_pred_output, logistic_model$symbol))
Pred_executor = mx.simple.bind(symbol = symbol_group, data = c(2, 1000), ctx = mx.cpu())
– 執行器已經寫完了,剩下的程序差不多,只有一些細節上的差異:
mx.exec.update.arg.arrays(Pred_executor, logistic_model$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(Pred_executor, logistic_model$aux.params, match.name = TRUE)
mx.exec.update.arg.arrays(Pred_executor, list(data = mx.nd.array(X.array)), match.name = TRUE)
mx.exec.forward(Pred_executor, is.train = FALSE)
PRED_linear_pred = as.array(Pred_executor$ref.outputs$linear_pred_output)
PRED_logistic_pred = as.array(Pred_executor$ref.outputs$logistic_pred_output)
confusion_table = table(PRED_logistic_pred > 0.5, Y)
print(confusion_table)
## Y
## 0 1
## FALSE 445 76
## TRUE 74 405
data(iris)
X.array = array(t(as.matrix(iris[,-5])), dim = c(4, 150))
Y.array = array(t(model.matrix(~ -1 + iris[,5])), dim = c(3, 150))
set.seed(0)
TRAIN.seq = sample(1:150, 100)
TRAIN.X.array = X.array[,TRAIN.seq]
TRAIN.Y.array = Y.array[,TRAIN.seq]
TEST.X.array = X.array[,-TRAIN.seq]
TEST.Y.array = Y.array[,-TRAIN.seq]
\[ \begin{align} mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} log(p_{i}) \end{align} \]
my_iterator_core = function(batch_size) {
batch = 0
batch_per_epoch = ncol(TRAIN.Y.array)/batch_size
reset = function() {batch <<- 0}
iter.next = function() {
batch <<- batch+1
if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
}
value = function() {
idx = 1:batch_size + (batch - 1) * batch_size
idx[idx > ncol(TRAIN.Y.array)] = sample(1:ncol(TRAIN.Y.array), sum(idx > ncol(TRAIN.Y.array)))
data = mx.nd.array(TRAIN.X.array[,idx, drop=FALSE])
label = mx.nd.array(TRAIN.Y.array[,idx, drop=FALSE])
return(list(data = data, label = label))
}
return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}
my_iterator_func <- setRefClass("Custom_Iter",
fields = c("iter", "batch_size"),
contains = "Rcpp_MXArrayDataIter",
methods = list(
initialize = function(iter, batch_size = 100){
.self$iter <- my_iterator_core(batch_size = batch_size)
.self
},
value = function(){
.self$iter$value()
},
iter.next = function(){
.self$iter$iter.next()
},
reset = function(){
.self$iter$reset()
},
finalize=function(){
}
)
)
my_iter = my_iterator_func(iter = NULL, batch_size = 20)
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
linear_pred = mx.symbol.FullyConnected(data = data, num.hidden = 3, name = 'linear_pred')
softmax_layer = mx.symbol.softmax(data = linear_pred, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
#1. Build an executor to train model
my_executor = mx.simple.bind(symbol = m_logloss,
data = c(4, 20), label = c(3, 20),
ctx = mx.cpu(), grad.req = "write")
#2. Set the initial parameters
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = m_logloss,
input.shape = list(data = c(4, 20), label = c(3, 20)),
output.shape = NULL,
initializer = mxnet:::mx.init.uniform(0.01),
ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
#3. Define the updater
my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
set.seed(0)
for (i in 1:200) {
my_iter$reset()
batch_loss = NULL
while (my_iter$iter.next()) {
my_values <- my_iter$value()
mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
mx.exec.forward(my_executor, is.train = TRUE)
mx.exec.backward(my_executor)
update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$m_logloss_output))
}
if (i %% 10 == 0 | i <= 5) {
cat(paste0("epoch = ", i, ": m-logloss = ", formatC(mean(batch_loss), format = "f", 4), "\n"))
}
}
## epoch = 1: m-logloss = 0.3504
## epoch = 2: m-logloss = 0.2999
## epoch = 3: m-logloss = 0.2367
## epoch = 4: m-logloss = 0.1970
## epoch = 5: m-logloss = 0.1736
## epoch = 10: m-logloss = 0.1293
## epoch = 20: m-logloss = 0.0983
## epoch = 30: m-logloss = 0.0822
## epoch = 40: m-logloss = 0.0719
## epoch = 50: m-logloss = 0.0648
## epoch = 60: m-logloss = 0.0595
## epoch = 70: m-logloss = 0.0554
## epoch = 80: m-logloss = 0.0521
## epoch = 90: m-logloss = 0.0494
## epoch = 100: m-logloss = 0.0471
## epoch = 110: m-logloss = 0.0452
## epoch = 120: m-logloss = 0.0436
## epoch = 130: m-logloss = 0.0421
## epoch = 140: m-logloss = 0.0408
## epoch = 150: m-logloss = 0.0397
## epoch = 160: m-logloss = 0.0387
## epoch = 170: m-logloss = 0.0378
## epoch = 180: m-logloss = 0.0369
## epoch = 190: m-logloss = 0.0362
## epoch = 200: m-logloss = 0.0355
softmax_model <- mxnet:::mx.model.extract.model(symbol = softmax_layer,
train.execs = list(my_executor))
predict_Y = predict(softmax_model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 0
## 3 0 2 17
– 讓我們嘗試一下把練習2的Softmax regression改造成具有1個隱藏層的多層感知器,其實只要把Model architecture的部分作修正,其他部分完全不需要改:
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
relu1 = mx.symbol.Activation(data = fc1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 3, name = 'fc2')
softmax_layer = mx.symbol.softmax(data = fc2, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
#1. Build an executor to train model
my_executor = mx.simple.bind(symbol = m_logloss,
data = c(4, 20), label = c(3, 20),
ctx = mx.cpu(), grad.req = "write")
#2. Set the initial parameters
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = m_logloss,
input.shape = list(data = c(4, 20), label = c(3, 20)),
output.shape = NULL,
initializer = mxnet:::mx.init.uniform(0.01),
ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
#3. Define the updater
my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
set.seed(0)
for (i in 1:200) {
my_iter$reset()
batch_loss = NULL
while (my_iter$iter.next()) {
my_values <- my_iter$value()
mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
mx.exec.forward(my_executor, is.train = TRUE)
mx.exec.backward(my_executor)
update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$m_logloss_output))
}
if (i %% 10 == 0 | i <= 5) {
message(paste0("epoch = ", i, ": m-logloss = ", formatC(mean(batch_loss), format = "f", 4)))
}
}
softmax_model <- mxnet:::mx.model.extract.model(symbol = softmax_layer,
train.execs = list(my_executor))
predict_Y = predict(softmax_model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 0
## 3 0 2 17
my.model.FeedForward.create = function (Iterator,
loss_symbol, pred_symbol,
Optimizer, num_round = 200) {
#0. Check data shape
Iterator$reset()
Iterator$iter.next()
my_values <- Iterator$value()
input_shape <- lapply(my_values, dim)
#1. Build an executor to train model
exec_list = list(symbol = loss_symbol, ctx = mx.cpu(), grad.req = "write")
exec_list = append(exec_list, input_shape)
my_executor = do.call(mx.simple.bind, exec_list)
#2. Set the initial parameters
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = loss_symbol,
input.shape = input_shape,
output.shape = NULL,
initializer = mxnet:::mx.init.uniform(0.01),
ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
#3. Define the updater
my_updater = mx.opt.get.updater(optimizer = Optimizer, weights = my_executor$ref.arg.arrays)
#4. Forward/Backward
set.seed(0)
for (i in 1:num_round) {
Iterator$reset()
batch_loss = NULL
while (Iterator$iter.next()) {
my_values <- Iterator$value()
mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
mx.exec.forward(my_executor, is.train = TRUE)
mx.exec.backward(my_executor)
update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
batch_loss = c(batch_loss, as.array(my_executor$ref.outputs[[1]]))
}
if (i %% 10 == 0 | i <= 5) {
message(paste0("epoch = ", i, ": loss = ", formatC(mean(batch_loss), format = "f", 4)))
}
}
#5. Get model
my_model <- mxnet:::mx.model.extract.model(symbol = pred_symbol,
train.execs = list(my_executor))
return(my_model)
}
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 0
## 3 0 2 17
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
dp1 = mx.symbol.Dropout(data = fc1, p = 0.2, name = 'dp1')
relu1 = mx.symbol.Activation(data = dp1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 3, name = 'fc3')
softmax_layer = mx.symbol.softmax(data = fc2, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 0
## 3 0 2 17
– 你是否開始覺得使用MxNet能加速自己對人工智慧的開發速度?
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
relu1 = mx.symbol.Activation(data = fc1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
relu2 = mx.symbol.Activation(data = fc2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
relu3 = mx.symbol.Activation(data = fc3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 3, name = 'fc4')
softmax_layer = mx.symbol.softmax(data = fc4, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 2 18 15 17
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 12 0
## 3 0 3 17
my.model.FeedForward.create = function (Iterator,
loss_symbol, pred_symbol,
Optimizer, num_round = 200) {
require(abind)
#0. Check data shape
Iterator$reset()
Iterator$iter.next()
my_values <- Iterator$value()
input_shape <- lapply(my_values, dim)
batch_size <- tail(input_shape[[1]], 1)
#1. Build an executor to train model
exec_list = list(symbol = loss_symbol, ctx = mx.cpu(), grad.req = "write")
exec_list = append(exec_list, input_shape)
my_executor = do.call(mx.simple.bind, exec_list)
#2. Set the initial parameters
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = loss_symbol,
input.shape = input_shape,
output.shape = NULL,
initializer = mxnet:::mx.init.uniform(0.01),
ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
#3. Define the updater
my_updater = mx.opt.get.updater(optimizer = Optimizer, weights = my_executor$ref.arg.arrays)
#4. Forward/Backward
message('Start training:')
set.seed(0)
epoch_grad = NULL
for (i in 1:num_round) {
Iterator$reset()
batch_loss = list()
batch_grad = list()
batch_seq = 0
t0 = Sys.time()
while (Iterator$iter.next()) {
my_values <- Iterator$value()
mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
mx.exec.forward(my_executor, is.train = TRUE)
mx.exec.backward(my_executor)
update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
batch_loss[[length(batch_loss) + 1]] = as.array(my_executor$ref.outputs[[1]])
grad_list = sapply(my_executor$ref.grad.arrays, function (x) {if (!is.null(x)) {mean(abs(as.array(x)))}})
grad_list = unlist(grad_list[grepl('weight', names(grad_list), fixed = TRUE)])
batch_grad[[length(batch_grad) + 1]] = grad_list
batch_seq = batch_seq + 1
}
if (i %% 10 == 0 | i <= 5) {
message(paste0("epoch = ", i,
": loss = ", formatC(mean(unlist(batch_loss)), format = "f", 4),
" (Speed: ", formatC(batch_seq * batch_size/as.numeric(Sys.time() - t0, units = 'secs'), format = "f", 2), " sample/secs)"))
}
epoch_grad = rbind(epoch_grad, apply(abind(batch_grad, along = 2), 1, mean))
}
epoch_grad[epoch_grad < 1e-8] = 1e-8
COL = rainbow(ncol(epoch_grad))
random_pos = 2^runif(ncol(epoch_grad), -0.5, 0.5)
plot(epoch_grad[,1] * random_pos[1], type = 'l', col = COL[1],
xlab = 'epoch', ylab = 'mean of abs(grad)', log = 'y',
ylim = range(epoch_grad))
for (i in 2:ncol(epoch_grad)) {lines(1:nrow(epoch_grad), epoch_grad[,i] * random_pos[i], col = COL[i])}
legend('bottomright', paste0('fc', 1:ncol(epoch_grad), '_weight'), col = COL, lwd = 1)
#5. Get model
my_model <- mxnet:::mx.model.extract.model(symbol = pred_symbol,
train.execs = list(my_executor))
return(my_model)
}
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 2 18 15 17
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 12 0
## 3 0 3 17
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
relu1 = mx.symbol.Activation(data = fc1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
relu2 = mx.symbol.Activation(data = fc2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
relu3 = mx.symbol.Activation(data = fc3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 10, name = 'fc4')
relu4 = mx.symbol.Activation(data = fc4, act.type = 'relu', name = 'relu4')
fc5 = mx.symbol.FullyConnected(data = relu4, num.hidden = 10, name = 'fc5')
relu5 = mx.symbol.Activation(data = fc5, act.type = 'relu', name = 'relu5')
fc6 = mx.symbol.FullyConnected(data = relu5, num.hidden = 10, name = 'fc6')
relu6 = mx.symbol.Activation(data = fc6, act.type = 'relu', name = 'relu6')
fc7 = mx.symbol.FullyConnected(data = relu6, num.hidden = 10, name = 'fc7')
relu7 = mx.symbol.Activation(data = fc7, act.type = 'relu', name = 'relu7')
fc8 = mx.symbol.FullyConnected(data = relu7, num.hidden = 3, name = 'fc8')
softmax_layer = mx.symbol.softmax(data = fc8, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 2 18 15 17
– 在梯度完全消失的狀況下,無論你多努力調整參數這個問題會再也無法解決。
– 理論上,反向傳播的過程隨著離輸出層越來越遠,梯度也將越來越小,讓我們再看看當初多層感知機的梯度公式吧:
\[ \begin{align} grad.o & = \frac{\partial}{\partial o}loss = \frac{o-y}{o(1-o)} \\ grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.o \otimes \frac{\partial}{\partial l_2}o= o-y \\ grad.W^2_1 & = \frac{\partial}{\partial W^2_1}loss = grad.l_2 \otimes \frac{\partial}{\partial W^2_1}l_2 = \frac{{1}}{n} \otimes (h_1^E)^T \bullet grad.l_2\\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_1)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes \frac{\partial}{\partial l_1}ReLU(l_1) \\ grad.W^1_d & = \frac{\partial}{\partial W^1_d}loss = grad.l_1 \otimes \frac{\partial}{\partial W^1_d}l_1 = \frac{{1}}{n} \otimes (x^E)^T \bullet grad.l_1 \end{align} \]
– 我們試想一下,目前我們隨機決定的權重大多是介於0的附近,因此輸入的值如果變異非常大,那就會造成梯度的波動。
TRAIN.X.array[3:4,] = TRAIN.X.array[3:4,] * 1000
TEST.X.array[3:4,] = TEST.X.array[3:4,] * 1000
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
dp1 = mx.symbol.Dropout(data = fc1, p = 0.2, name = 'dp1')
relu1 = mx.symbol.Activation(data = dp1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 3, name = 'fc3')
softmax_layer = mx.symbol.softmax(data = fc2, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 4
## 3 0 2 13
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 2 18 15 17
data(iris)
X.array = array(t(as.matrix(iris[,-5])), dim = c(4, 150))
for (i in 1:4) {X.array[i,] = (X.array[i,] - mean(X.array[i,]))/sd(X.array[i,])}
Y.array = array(t(model.matrix(~ -1 + iris[,5])), dim = c(3, 150))
set.seed(0)
TRAIN.seq = sample(1:150, 100)
TRAIN.X.array = X.array[,TRAIN.seq]
TRAIN.Y.array = Y.array[,TRAIN.seq]
TEST.X.array = X.array[,-TRAIN.seq]
TEST.Y.array = Y.array[,-TRAIN.seq]
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 0
## 3 0 2 17
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 14 0
## 3 0 1 17
– 這個做法叫做「批量標準化」(Batch normalization),兩位Google的研究員Sergey Ioffe以及Christian Szegedy在2015年所發表的研究:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift第一次提到了這個想法。
\[ \begin{align} \hat{x_i} & = \frac{x_i - \bar{x}}{\sqrt{\sigma^2_{x} + \epsilon}} \\ y_i = BatchNorm(x_i) & = \hat{x_i} \times \gamma \ + \beta \\\\ \bar{x} & = \frac{1}{n} \sum\limits_{i=1}^{n} x_i \\ \sigma^2_{x} & = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \bar{x})^2 \end{align} \]
– 這裡的\(\epsilon\)代表一個很小的數字(避免除以0),\(\bar{x}\)與\(\sigma^2_{x}\)則分別是\(x\)的平均值以及變異數,\(\gamma\)以及\(\beta\)則是兩個線性轉換項,這使批量標準化是一個可還原的過程(假定\(\gamma = \sqrt{\sigma^2_{x} + \epsilon}\)而\(\beta = \bar{x}\))
data(iris)
demo_X.array = array(t(as.matrix(iris[,-5])), dim = c(4, 150))
bn_X.array = demo_X.array
for (i in 1:4) {
bn_mean = mean(demo_X.array[i,])
bn_var = var(demo_X.array[i,])
eps = 1e-3
gamma = 1
beta = 0
bn_X.array[i,] = (demo_X.array[i,] - bn_mean) / sqrt(bn_var + eps) * gamma + beta
}
– 我們假設在反向傳播到\(BatchNorm\)時已經存在一個\(grad.y\),並以這個開始往下推導(過程略):
\[\begin{align} \frac{\partial y}{\partial \beta} & = \frac{1}{n} \sum\limits_{i=1}^{n} grad.y_i \\ \frac{\partial y}{\partial \gamma} & = \frac{1}{n} \sum\limits_{i=1}^{n} grad.y_i \times \hat{x_i} \\\\ \frac{\partial y}{\partial \hat{x}} & = grad.y \otimes \gamma \\ \frac{\partial y}{\partial \sigma^2_{x}} & = - \frac{1} {2} \sum\limits_{i=1}^{n} \gamma (x_i - \bar{x}) (\sigma^2_{x} + \epsilon)^{-1.5} grad.y_i \\ \frac{\partial y}{\partial \bar{x}} & = \sum\limits_{i=1}^{n} \frac {- grad.y_i \times \gamma} {\sqrt{\sigma^2_{x} + \epsilon}} + \frac{\partial y}{\partial \sigma^2_{x}} \times \frac {-2 \sum\limits_{i=1}^{n} (x_i - \bar{x}) } {n} \\\\ \frac{\partial y}{\partial x} & = \frac{\partial y}{\partial \hat{x}} \otimes \frac {1} {\sqrt{\sigma^2_{x} + \epsilon}} \oplus \frac{\partial y}{\partial \sigma^2_{x}} \otimes \frac {2(x_i - \bar{x})} {n} \oplus \frac{\partial y}{\partial \bar{x}} \otimes \frac {1} {n}\end{align}\]
– 在MxNet的輔助下,要實現批量標準化跟實現Dropout一樣簡單!
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
bn1 = mx.symbol.BatchNorm(data = fc1, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn1')
relu1 = mx.symbol.Activation(data = bn1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
bn2 = mx.symbol.BatchNorm(data = fc2, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn2')
relu2 = mx.symbol.Activation(data = bn2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
bn3 = mx.symbol.BatchNorm(data = fc3, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn3')
relu3 = mx.symbol.Activation(data = bn3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 3, name = 'fc4')
softmax_layer = mx.symbol.softmax(data = fc4, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 14 1
## 3 0 1 16
PARAMS = model$arg.params
MEANS = model$aux.params
ls(PARAMS)
## [1] "bn1_beta" "bn1_gamma" "bn2_beta" "bn2_gamma" "bn3_beta"
## [6] "bn3_gamma" "fc1_bias" "fc1_weight" "fc2_bias" "fc2_weight"
## [11] "fc3_bias" "fc3_weight" "fc4_bias" "fc4_weight"
ls(MEANS)
## [1] "bn1_moving_mean" "bn1_moving_var" "bn2_moving_mean" "bn2_moving_var"
## [5] "bn3_moving_mean" "bn3_moving_var"
Input = TEST.X.array[,1]
dim(Input) = c(4, 1)
preds = predict(model, Input, array.layout = "colmajor")
print(preds)
## [,1]
## [1,] 9.999359e-01
## [2,] 2.815460e-05
## [3,] 3.597806e-05
PARAMS = model$arg.params
MEANS = model$aux.params
Input = TEST.X.array[,1]
dim(Input) = c(4, 1)
bn_eps = 1e-3
fc1_out = t(Input) %*% as.array(PARAMS$fc1_weight) + as.array(PARAMS$fc1_bias)
bn1_out = (fc1_out - as.array(MEANS$bn1_moving_mean)) / sqrt(as.array(MEANS$bn1_moving_var) + bn_eps) * as.array(PARAMS$bn1_gamma) + as.array(PARAMS$bn1_beta)
relu1_out = bn1_out
relu1_out[relu1_out < 0] = 0
fc2_out = relu1_out %*% as.array(PARAMS$fc2_weight) + as.array(PARAMS$fc2_bias)
bn2_out = (fc2_out - as.array(MEANS$bn2_moving_mean)) / sqrt(as.array(MEANS$bn2_moving_var) + bn_eps) * as.array(PARAMS$bn2_gamma) + as.array(PARAMS$bn2_beta)
relu2_out = bn2_out
relu2_out[relu2_out < 0] = 0
fc3_out = relu2_out %*% as.array(PARAMS$fc3_weight) + as.array(PARAMS$fc3_bias)
bn3_out = (fc3_out - as.array(MEANS$bn3_moving_mean)) / sqrt(as.array(MEANS$bn3_moving_var) + bn_eps) * as.array(PARAMS$bn3_gamma) + as.array(PARAMS$bn3_beta)
relu3_out = bn3_out
relu3_out[relu3_out < 0] = 0
fc4_out = relu3_out %*% as.array(PARAMS$fc4_weight) + as.array(PARAMS$fc4_bias)
Softmax_out = exp(fc4_out)/sum(exp(fc4_out))
cbind(t(Softmax_out), preds)
## [,1] [,2]
## [1,] 9.999359e-01 9.999359e-01
## [2,] 2.815463e-05 2.815460e-05
## [3,] 3.597807e-05 3.597806e-05
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
bn1 = mx.symbol.BatchNorm(data = fc1, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn1')
relu1 = mx.symbol.Activation(data = bn1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
bn2 = mx.symbol.BatchNorm(data = fc2, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn2')
relu2 = mx.symbol.Activation(data = bn2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
bn3 = mx.symbol.BatchNorm(data = fc3, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn3')
relu3 = mx.symbol.Activation(data = bn3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 10, name = 'fc4')
bn4 = mx.symbol.BatchNorm(data = fc4, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn4')
relu4 = mx.symbol.Activation(data = bn4, act.type = 'relu', name = 'relu4')
fc5 = mx.symbol.FullyConnected(data = relu4, num.hidden = 10, name = 'fc5')
bn5 = mx.symbol.BatchNorm(data = fc5, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn5')
relu5 = mx.symbol.Activation(data = bn5, act.type = 'relu', name = 'relu5')
fc6 = mx.symbol.FullyConnected(data = relu5, num.hidden = 10, name = 'fc6')
bn6 = mx.symbol.BatchNorm(data = fc6, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn6')
relu6 = mx.symbol.Activation(data = bn6, act.type = 'relu', name = 'relu6')
fc7 = mx.symbol.FullyConnected(data = relu6, num.hidden = 10, name = 'fc7')
bn7 = mx.symbol.BatchNorm(data = fc7, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn7')
relu7 = mx.symbol.Activation(data = bn7, act.type = 'relu', name = 'relu7')
fc8 = mx.symbol.FullyConnected(data = relu7, num.hidden = 3, name = 'fc8')
softmax_layer = mx.symbol.softmax(data = fc8, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 1 18 0 0
## 2 0 13 0
## 3 0 2 17
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
for (i in 1:30) {
if (i == 1) {
fc = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = paste0('fc', i))
} else {
fc = mx.symbol.FullyConnected(data = relu, num.hidden = 10, name = paste0('fc', i))
}
bn = mx.symbol.BatchNorm(data = fc, axis = 1, name = paste0('bn', i))
relu = mx.symbol.Activation(data = bn, act.type = 'relu', name = paste0('relu', i))
}
fc_final = mx.symbol.FullyConnected(data = relu, num.hidden = 3, name = 'fc_final')
softmax_layer = mx.symbol.softmax(data = fc_final, axis = 1, name = 'softmax_layer')
eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
epsilon = 1e-08, wd = 0)
model = my.model.FeedForward.create(Iterator = my_iter,
loss_symbol = m_logloss, pred_symbol = softmax_layer,
Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##
## 1 2 3
## 2 18 15 17
– 這個問題其實是權重在最開始的時候是完全隨機決定的,故你可以想像在正向傳播的過程其實就是不斷的在損失資訊,而到了30層或更深的時候的時候這些資訊其實跟0沒有什麼差別了。而反向傳播卻要透過這些誤差來往回做梯度更新,從而導致整個網路優化失敗。
– 目前我們面對梯度消失問題僅有第三節課的時候所學習的Adam以及Batch Normalization。Adam的優勢在於若各層尺度差異極大時,那他還有辦法平衡各層參數的更新速度。Batch Normalization則可以減緩梯度在反向傳播時的損失。然而若是梯度直接變成0,那這些方法都將無濟於事。
– 然而2012年以後的人工智慧研究熱點是在於電腦視覺,因此這些突破性的研究也大多也在這個領域。而目前為止我們尚未開始正式介紹人工智慧領域在電腦視覺上的特殊發展,故我們下一節課會先將這些知識做一些補充,之後在基於這些知識來看看這些最先進的研究如何有效的解決梯度消失問題!